Objective

The current project explores a dataset with information about white wines, where their chemical properties are shown side-by-side with a quality rank, being it the median grade given by professional tasters. The objective is to find out if there’s a clear relationship between the perceived quality of a wine and its chemical properties.


Data Structure

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The white wines table has 13 variables and 4898 observations.

All input variables (those based on physicochemical tests) are numerical.
The output variable, quality, is an integer.
The X variable is the table index. It’s not useful and may pollute our analisys with unecessary plots. We will drop it.

ww <- subset(ww, select = -X)

After filtering X variable, 12 variables remain: wine quality and the 11 chemical properties, all described below.


Variables Description

Input variables (based on physicochemical tests):

  1. fixed acidity (tartaric acid - g / dm3). Most acids involved with wine or fixed or nonvolatile (do not evaporate readily).

  2. volatile acidity (acetic acid - g / dm3 ). The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.

  3. citric acid (g / dm3 ). Found in small quantities, citric acid can add ‘freshness’ and flavor to wines.

  4. residual sugar (g / dm3 ). The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.

  5. chlorides (sodium chloride - g / dm3 ). The amount of salt in the wine.

  6. free sulfur dioxide (mg / dm3 ). The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.

  7. total sulfur dioxide (mg / dm3 ). Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

  8. density (g / dm3 ). The density of wine is close to that of water depending on the percent alcohol and sugar content.

  9. pH. Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

  10. sulphates (potassium sulphate - g / dm3 ). A wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.

  11. alcohol (% by volume). The percent alcohol content of the wine.

    Output variable (based on qualitative tests):

  12. quality (rated 0-10). The rate is the median of at least 3 wine experts, where 0 = very bad and 10 = very excellent.

Univariate Data Analysis

Data Summary

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Histograms

Quality

The wine quality histogram returned a shape resembling a normal curve, with most ones concentrated from 5-7. No wine in our dataset got a quality grade worst than 3 or better than 9.

Fixed Acidity

The fixed acidity shows a normal-like distribution, with most values ranging from 5 to 9 g/dm3.

Volatile Acidity

Volatility acidity is left skewed, with most ammounts ranging from 0.15 to 0.40 g/dm3. A log10 transformation helped us better see the data distribution.

Citric Acid

Citric acid is found in small quantities, with most wines ranging between 0 and 0.8 g/dm3. It’s curve has a normal-like shape with a few extreme outliers above 0.8 g/dm3.

Residual Sugar

The graph is left skewed, with most common ammounts of residual sugars between 1 and 3 g/dm3. We again benefitted from log10 transformation for better visualization, but unlike what we expected, it didn’t return a bell-shaped format but rather a bimodal shape.

Chlorides

The majority of chlorides are concentrated between between 0.03 and 0.10 g / dm3 in a normal-like distribution, with some extreme outliers. These outlies refrain us from clearly seeing the finer distribution, which was achieved once more through a log10 transformation.

Free Sulfur Dioxide

Free Sulfur Dioxide has a few wild outliers, which prevent us from properly seeing the distribution. For this reason, the ‘X’ axis was represented through a log10 scale. The grand majority of values range from 15 to 70 mg / dm3.

Total Sulfur Dioxide

Most white wines have a Total Sulfur Dioxide between 60 and 220 mg / dm3, fairly normally distributed towards 130/140.

Density

As stated before, the density of wine is close to that of water depending on the percent alcohol and sugar content. This variable should then be highly correlated to those two variables. Most values are in a narrow range from 0.990 and 0.999 g / dm3.

pH

The pH scale range from 0 (very acidic) to 14 (very basic), but all white wines are normal-like distributed in a narrow range from 2.7 to 3.8.

Sulphates

Sulphates (potassium sulphate) is a wine additive wich acts as an antimicrobial and antioxidant agent. Most wines have a concentration between 0.4 and 0.6 g / dm3.

Alcohol

Alcohol content has a relatively wide range: most wines contain from 8.7% to 13%.
Nothing can be said about it so far, but there’s fewer wines that reach a higher alcohol concentration.

Key Takeaways

Based on our preliminary examination of individual variables and their value distributions, we noticed most variables are either normal-like distributed or left skewed. Special attention to residual sugars, which has a bimodal distribution.

No data cleansing or any other form of data transformation was performed so far: outliers were kept in the database and no new variables were crated.


Bivariate Data Analysis

Correlation Matrix

Correlation measures fall between -1 and 1, being numbers close to -1 negatively correlated and those close to +1 positively correlated.

x <= -0.9 | x >= +0.9 –> very strong correlation

-0.9 < x <= -0.7 | +0.9 > x >= +0.7 –> strong correlation

-0.7 < x <= -0.5 | +0.7 > x >= +0.5 –> moderate correlation

-0.5 < x <= -0.3 | +0.5 > x >= +0.3 –> weak correlation

x > -0.3 | x < +0.3 –> negligible correlation

The correlation matrix above shows that no variable has even a moderate direct correlation to quality, being alcohol content the one which comes closer, with a weak correlation of 0.4.

Strong Correlations:
1. Negative correlation between alcohol and density (-0.8);
2. Positive correlation between residual sugar and density (+0.8).

Moderate Correlations:
1. Negative correlation between alcohol and residual sugar (-0.5);
2. Negative correlation between alcohol and chlorides (-0.4);
3. Positive correlation between total sulfur dioxide and density (+0.5);
4. Positive correlation between total sulfur dioxide and free sulfur dioxide (+0.6).

Boxplots

Fixed Acidity

Volatile Acidity

High levels of acetic acid can lead to an unpleasant, vinegar taste.
Levels of volatile acidity above 0.36 g / dm3 is rare for an above average wine.
Above 0.5 g / dm3 is almost certain a bad wine.

Citric Acid

Can’t say much about citric acid concentration, except that for the highest quality wines (with very few individual cases) there’s a higher concentration.

Residual Sugar

We can’t say much about residual sugar concentration. It could be associated to its convertion to alcohol, which is a sign of a high quality wine. But it could also mean the grapes had a lot more sugar at the beginning fo the process, leading to no conclusion at all.
We should be very careful with this variable and, if possible, leave it out of our analysis.

Chlorides

There’s a clear median trend showing the lower the chloride level, the better the wine. But this trend is not corroborated by the correlation (-0.2). It’s certainly due to the overlapping interquartile ranges and wide variabilities.

Free Sulfur Dioxide

Total Sulfur Dioxide

There’s a clear convergence of the best wines towards a range between 100 and 150 g / dm3. The tendency also favors low concentrations over high ones.

Density

There’s a tendency showing that the lower the density, the better. Highly correlated to alcohol content.

pH

There’s not a clear tendency between pH and quality.

Sulphates

Nothing can be said about sulphate concentration.

Alcohol

Alcohol presents a tendency between its concentration and quality. Usually the more alcohol content, the better.

Observed Relationships

So far we have empirically observed that a good wine should have:

  1. High percentage of alcohol
  2. Low density
  3. Low total sulfur dioxide


Doing a quick web search I verified that densitometry is a known method for determining wine alcohol content.
Source: The Australian Wine Research Institute.
As density is strongly (inversely) correlated to alcohol content, it will be dropped from further analysis.


Multivariate Data Analysis

Grouping the Quality Variable


A new variable, grouping the wines into low (3-5), mid (6) and high (7-9) quality will be created. The mid group, represented by quality grade 6, is not only the median, mode and mean, but also accounts alone for more wines than the other two groups. For this reason it has a group for itself.

ww$quality.cut <- cut(ww$quality, c(2, 5, 6, 9), labels = c("Low Quality", "Mid Quality", "High Quality"), ordered_result = TRUE) 

summary(ww$quality.cut)
##  Low Quality  Mid Quality High Quality 
##         1640         2198         1060

Variables Matrix

Full Dataset

The focus here goes to the scatterplots (left, below diagonal). The intention is to find separate groups by quality (colors) in the intersection between two other variables.
The ones found were:
1. Alcohol x volatile acidity
2. Alcohol x pH

It means volatile acidity and pH need to be analyzed for an indirect effect on quality.

Subsetting the Dataset

At this point we will analyze the relationship between the relevant variables among themselves and also quality.
To make it simpler, a new dataframe containing this subset will be created.

ww_sub <- subset(ww, select = c(volatile.acidity, total.sulfur.dioxide, pH, alcohol, quality.cut))

Subsetted Dataset

After subsetting, the relations became easier to see by naked eye.

Scatterplots

Note: volatile acidity has a skewed distribution, benefitting from a log10 transformation for better visualization.

Alcohol vs. Volatile Acidity

Trhough the scatterplot graph we see a higher concentration low quality wines (red dots) at a lower alcohol content and vice-versa. As seen before while analysing box blots, alcohol is a desired characteristic.
At every Alcohol level, higher volatile acidity is associated to lower wine quality. It’s not a desired characteristic.

Alcohol vs. pH

In most cases, higher pH is preferred over low pH, but it’s not a general rule. Low alcoholic white wines get good grades when associated with lower pH (higher acidity).

Alcohol vs. Total Sulfur Dioxide

Nothing can be said about this relation.

Volatile Acidity vs. Total Sulfur Dioxide

In most cases, lower levels of Total Sulfur Dioxide is preferred over high levels.
This relation could not be perceived before. It’s an indirect effect on quality.

Key Takeaways

Observing the most relevant features through multivariate scatterplots, it was possible to closely analyze what was empirically observed through bivariate analysis and the multivariate matrix:

The general desired features in a white wine are:
1. High alcohol content %
2. Low volatile acidity level
3. Low Total Sulfur Dioxide level


Final Plots and Summary

Alcohol content is, alone, the most relevant feature to explain wine quality. The correlation is clear just by seeing the boxplot, with its steep curve and small range among the highest quality wines (grade 9). When calculated, it showed a 0.4 correlation to quality. Still weak, but the highest among all variables.
It’s also clear that something else than alcohol must have gone really wrong with the lowest quality wines (grades 3 & 4, mostly). Despite a raise in alcohol content in relation to slighter better wines, the final result was awful.

The variable description already states that high of levels of volatile acidity (acetic acid) can lead to an unpleasant, vinegar taste.

As a standalone feature, volatile acidity influence can only be perceived in very low quality wines. But seen in conjunction with alcohol content, it’s clear that high volatile acidity is not a desired feature at all.

Total sulfur dioxide (SO2) becomes evident in the nose and taste of wine above 50ppm, which accounts for more than 99% of all analyzed wines. Seeing apart from other variables its effect over quality is inconclusive. But seeing in conjunction with volatile acidity it becomes clear it’s not desired.

Summary

Putting it all together, higher alcohol concentration is better than lower concentration. No matter what alcoholic level, lower volatile acidity gives us a better wine. And no matter what volatile acidity level, lower total sulfur dioxide concentration is preferred. These three variables combined gives most certainly a good white wine.


Reflection

Based on this exploratory data analysis (EDA), it was possible not only have a first impression about the dataset, its variables values ranges and existing relations between them, but also to have a first grasp on chemical properties effects over quality.

Through the analysis it was also clear that not every effect can be directly spotted, being necessary to make log transformations and limit value ranges to avoid extreme outliers taking most graph space. It was also necessary indirect relations to spot important features. For exemple total sulfur dioxide (SO2): first, the relation between alcohol content and quality. Second, the effect of volatile acidity for every alcohol content. At last, the effect of SO2 for eveyr volatile acidity range.
The current analysis can be further developed into other variables properties by using a statistical model like decision tree, for example. The database could also be used to classify unknown new wines using machine learning techniques. For now only the most obvious relations were taken into consideration.


References

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.


Australian Wine Research Institute. Website. 08 Oct. 2017. https://www.awri.com.au/industry_support/winemaking_resources/laboratory_methods/chemical/alcohol/.